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Abstract. The probability that two spatial objects establish some kind 
of mutual connection often depends on their proximity. To formalize this 
concept, we define the notion of a probabilistic neighborhood: Let P be 
a set of n points in R'*, g £ R'* a query point, dist a distance metric, 
and / : R^ —>■ [0,1] a monotonically decreasing function. Then, the 
probabilistic neighborhood N{q, f) of q with respect to / is a random 
subset of P and each point p £ P belongs to N{q, f) with probability 
/(dist(p, g)). Possible applications include query sampling and the simu¬ 
lation of probabilistic spreading phenomena, as well as other scenarios 
where the probability of a connection between two entities decreases 
with their distance. We present a fast, sublinear-time query algorithm to 
sample probabilistic neighborhoods from planar point sets. For certain 
distributions of planar P, we prove that our algorithm answers a query 
in 0{{\N{q, f)\ + ^/n) logn) time with high probability. In experiments 
this yields a speedup over pairwise distance probing of at least one order 
of magnitude, even for rather small data sets with n = 10® and also for 
other point distributions not covered by the theoretical results. 


1 Introduction 

In many scenarios, connections between spatial objects are not certain but 
probabilistic, with the probability depending on the distance between them: 
The probability that a customer shops at a certain physical store shrinks with 
increasing distance to it. In disease simulations, if the social interaction graph 
is unknown but locations are available, disease transmission can be modeled as 
a random process with infection risk decreasing with distance. Moreover, the 
wireless connections between units in an ad-hoc network are fragile and collapse 
more frequently with higher distance. 

For these and similar scenarios, we define the notion of a probabilistic neigh¬ 
borhood in spatial data sets: Let a set P of n points in a query point g £ 
a distance metric dist, and a monotonically decreasing function / : IR+ —)■ [0,1] 
be given. Then, the probabilistic neighborhood N{q, f) of q with respect to / is 
a random subset of P and each point p G P belongs to N{q, f) with probability 
/(dist(p, g)). A straightforward query algorithm for sampling a probabilistic 
neighborhood would iterate over each point p G P and sample for each whether it 
is included in N(g, /). This has a running time of 0{n • d) per query point, which 



is prohibitive for repeated queries in large data sets. Thus we are interested in 
a faster algorithm for such a probabilistic neighborhood query (PNQ, spoken as 
“pink”). We restrict ourselves to the planar case in this work, but the algorithmic 
principle is generalizable to higher dimensions. 

While the linear-time approach has appeared before in the literature for a 
particular application [5] (without formulating the problem as a PNQ explicitly), 
we are not aware of previous work performing more efficient PNQs with an index 
structure. For example, the probabilistic quadtree introduced by Kraetzschmar et 
al. [12] is designed to store probabilistic occupancy data and gives deterministic 
results. Other range queries related to (yet different from) our work as well as 
deterministic index structures are described in Section 12.21 


Contributions. We develop, analyze, implement, and evaluate an index structure 
and a query algorithm that together provide fast probabilistic neighborhood 
queries in the Euclidean and hyperbolic plane. Our key data structure for these 
fast PNQs is a polar quadtree which we adapt from our previous work m- Prepro¬ 
cessing for quadtree construction requires 0(n log n) time with high probabilitjQ 
(whp). 

To answer PNQs, we first present a simple query algorithm (Section]^. We 
then improve its time complexity by treating whole subtrees as so-called virtual 
leaves, see Section]^ As shown by our detailed theoretical analysis, the improved 
algorithm yields a query time complexity of 0((|N(q,/)| -|- ^/rl)\ogn) whp to 
find a probabilistic neighborhood N(g, /) among n points, for n sufficiently large. 
This is sublinear if the returned neighborhood N(g, /) is of size o{n/ logn) - an 
assumption we consider reasonable for most applications. For our theoretical 
results to hold, the quadtree structure needs to be able to partition the distribution 
of the point positions in P, i. e. not all of the probability mass may be concentrated 
on a single point or line. In our case of polar quadtrees, this is achieved if the 
distribution is continuous, integrable, rotationally invariant with respect to the 
origin and non-zero only for a finite area. 

Experimental results are shown in Section]^ We apply our query algorithm to 
generate random graphs in the hyperbolic plane m in subquadratic time. Graphs 
with millions of edges can now be generated within a few minutes sequentially. 
This yields an acceleration of at least one order of magnitude in practice compared 
to a reference implementation [2] that uses linear-time queries. Compared to our 
previous work on graph generation |19j . our new algorithm is able to generate a 
more extensive model. Even if the distribution of a given point set P is unknown 
in practice, running times are fast: As an example of probabilistic spreading 
behavior, we simulate a simple disease spreading mechanism on real population 
density geodata. In this scenario, our fast PNQs are at least two orders of 
magnitude faster than linear-time queries. 


^ We say “with high probability” (whp) when referring to a probability > 1 — 1/n for 
sufficiently large n. 
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2 Preliminaries 


2.1 Notation 

Let the input be given as set P of n points. The points in P are distributed in 
a disk Djj of radius R in the hyperbolic or Euclidean plane, the distribution 
is given by a probability density function j{4>,'r) for an angle </> and a radius 
r. Recall that, for our theoretical results to hold, we require j to be known, 
continuous and integrable. Furthermore, j needs to be rotationally invariant - 
meaning that j{4ii,r) = for any radius r and any two angles (j>i and 

(j )2 - and positive within D/j, so that j{r) > 0 ^ r < R. Due to the rotational 
invariance, j(<^,r) is the same for every (j) and we can write j{r). Likewise, we 
define J(r) as the indefinite integral of j{r) and normalize it so that J{R) = 1 
(also implying J(0) = 0). The value J{r) then gives the fraction of probability 
mass inside radius r. 

For the distance between two points pi and p 2 , we use distu (pi,p 2 ) for the 
hyperbolic and distE(pi,P 2 ) for the Euclidean case. We may omit the index 
if a distinction is unnecessary. As mentioned, a point p is in the probabilistic 
neighborhood of query point q with probability /(dist(p, q)). Thus, a query pair 
consists of a query point q and a function / : K+ —?> [0,1] that maps distances 
to probabilities. The function / needs to be monotonically decreasing but may 
be discontinuous. Note that / can be defined differently for each query. The 
query result, the probabilistic neighborhood of q w. r. t. /, is denoted by the set 
N((;,/)CP. 

For the algorithm analysis, we use two additional sets for each query (g, /): 

— Candidates(g, /): neighbor candidates examined when executing such a query, 

— Cells(g, /): quadtree cells examined during execution of the query. 

Note that the sets N(( 7 , /), Candidates(( 7 , /) and Cells(g, /) are probabilistic, thus 
theoretical results about their size are usually only with high probability. 


2.2 Related Work 

Fast deterministic range queries. Numerous index structures for fast range 
queries on spatial data exist. Many such index structures are based on trees 
or variations thereof, see Samet’s book m for a comprehensive overview. I/O 
efficient worst case analysis is usually performed using the EM model, see e. g. [3]. 
In more applied settings, average-case performance is of higher importance, which 
popularized R-trees or newer variants thereof, e. g. m- Concerning (balanced) 
quadtrees for spatial dimension d, it is known that queries require 0{d ■ 
time (thus 0{^/n) in the planar case) [T71 Ch. 1.4]. Regarding PNQs our algorithm 
matches this query complexity up to a logarithmic factor. Yet note that, since 
for general / and dist in our scenario all points in the set P could be neighbors, 
data structures for deterministic queries cannot solve a PNQ efficiently without 
adaptations. 
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Hu et al. [10] give a query sampling algorithm for one-dimensional data that, 
given a set P of n points in K, an interval q = [x, y\ and an integer, t > 1, returns 
t elements uniformly sampled from PC\q. They describe a structure of 0{n) space 
that answers a query in 0(logn -|- 1) time and supports updates in O(logn) time. 
While also offering query sampling, PNQs differ from the problem considered 
by Hu et al. in two aspects: We consider two dimensions instead of one and our 
sampling probabilities are not necessarily uniform, but can be set by the user by 
a distance-dependent function. 

Range queries on uncertain data. During the previous decade probabilistic queries 
different from PNQs have become popular. The main scenarios can be put into 
two categories m- (i) Probabilistic databases contain entries that come with 
a specified confidence (e. g. sensor data whose accuracy is uncertain) and (ii) 
objects with an uncertain location, i. e. the location is specified by a probability 
distribution. Both scenarios differ under typical and reasonable assumptions from 
ours: Queries for uncertain data are usually formulated to return all points in 
the neighborhood whose confidence/probability exceeds a certain threshold [T5] . 
or computing points that are possibly nearest neighbors [T]. 

In our model, in turn, the choice of inclusion of a point p is a random choice 
for every different p. In particular, depending on the probability distribution, 
all nodes in the plane can have positive probability to be part of some other’s 
neighborhood. In the related scenarios this would only be true with extremely 
small confidence values or extremely large query circles. 

Applications in fast graph generation. One application for PNQs as introduced 
in Section is the hyperbolic random graph model by Krioukov et al. [14] . 
The n graph nodes are represented by points thrown into the hyperbolic plane 
at randorr0 and two nodes are connected by an edge with a probability that 
decreases with the distance between them. An implementation of this generative 
model is available [5], it performs Olnf ) neighborhood tests. Bringmann et al. 
provide an algorithm to generate hyperbolic random graphs in expected linear 
time |6]; to our knowledge no implementation of it exists yet. 

In previous work we designed a generator [19] faster than [2] for a restricted 
model; it runs in -I- m) log n) time whp for the whole graph with m edges. 

The range queries discussed there are facilitated by a quadtree which supports 
only deterministic queries. Consequently, the queries result in unit-disk graphs 
in the hyperbolic plane and can be considered as a special case of the current 
work (a step function / with values 0 and 1 results in a deterministic query). 

Our major technical inspiration for enhancing the quadtree for probabilistic 
neighborhoods is the work of Batagelj and Brandes [5]. They were the first to 
present a random sampling method to generate Erdds-Renyi-graphs with n nodes 
and m edges in 0{n-\-m) time complexity. Faced with a similar problem of selecting 
each of n elements with a constant probability p, they designed an efficient 

^ The probability density in the polar model depends only on radii r and R as well as 
a growth parameter a and is given by g(r) := • 
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algorithm (see Algorithm in Appendix [A| . Instead of sampling each element 
separately, they use random jumps of length S{p), S{p) — ln(l — rand)/ — p), 

with rand being a random number uniformly distributed in [0,1). 

2.3 Quadtree Specifics 

Our key data structure is a polar region quadtree in the Euclidean or hyperbolic 
plane. While they are less suited to higher dimensions as for example k-d-trees, 
the complexity is comparable in the plane. For the (circular) range queries we 
discuss, quadtrees have the significant advantage of a bounded aspect ratio: A cell 
in a k-d-tree might extend arbitrarily far in one direction, rendering theoretical 
guarantees about the area affected by the query circle difficult to impossible. In 
contrast, the region covered by a quadtree cell is determined by its position and 
level. 

We mostly reuse our previous definition |19| of the quadtree: A node in the 
quadtree is defined as a tuple (min^, max^, minj., max^.) with min^ < max^ and 
min^ < max^. It is responsible for a point p = {4>p,rp) exactly if (min^ < 4>p < 
max^) and (min,. < rp < max^.). We call the region represented by a particular 
quadtree node its quadtree cell. The quadtree is parametrized by its radius R, 
the maxr of the root cell. If the probability distribution j is known (which we 
assume for our theoretical results), we set the radius R to argmin^ J(r) = 1, 
i. e. to the minimum radius that contains the full probability mass. If only the 
points are known, the radius is set to include all of them. While in this latter 
case the complexity analysis of Section]^ and 15 does not hold, fast running times 
in practice can still be achieved (see Section^. 

3 Baseline Query Algorithm 

We begin the main technical part by describing adaptations in the quadtree 
construction as well as a baseline query algorithm. This latter algorithm introduces 
the main idea, but is asymptotically not faster than the straightforward approach. 
In Section]^ it is then refined to support faster queries. 

3.1 Quadtree Construction 

At each quadtree node v, we store the size of the subtree rooted there. We then 
generalize the rule for node splitting to handle point distributions j as defined in 
Section [2T| As is usual for quadtrees, a leaf cell c is split into four children when 
it exceeds its fixed capacity. Since our quadtree is polar, this split happens once 
in the angular and once in the radial direction. Due to the rotational symmetry 
of J, splitting in the angular direction is straightforward as the angle range is 
halved: mid,^ := _ Pqj. radial direction, we choose the splitting 

radius to result in an equal division of probability mass. The total probability 
mass in a ring delimited by miur and max^ is J(maxj.) — J(minr). Since j{r) is 
positive for r between R and 0, the restricted function J|[o,i?] defined above is a 
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bijection. The inverse (J|[o,_r]) 

\ — l ( J(maxT.) +J(minr) 


-1 


thus exists and we set the splitting radius midr 


to (^|[0,fi]) ^ (' 


<% 


X’ 




Figure [2 visualizes a point distribution on 4 0 ^ 

a hyperbolic disk with 200 points and Figure ^^4“ ° 

its corresponding quadtree. ? ^ 0 % 

Two results on quadtree properties help to 44 
establish the time complexity of quadtree op- 1 ° . » 

erations. They are generalized versions of our ^°o4 °° 

previous work na Lemmas 1 and 2] and state ° ° 

that each quadtree cell contains the same ex- \X 
pected number of points and that the quadtree 
height is O(logn) whp (proofs in Appendix]^. 

Lemma 1. Let be a hyperbolie or Eu¬ 
clidean disk of radius R, j a probability dis¬ 
tribution on Dij which fulfills the properties 
defined in Section \2.1\ p a point in D/j which 
is sampled from j, and T be a polar quadtree 
on Dfl. Let C be a quadtree cell at depth i. 

Then, the probability that p is in C is 4“*. 

Lemma 2. Just as in Lemma^ let Dij be a 

hyperbolic or Euclidean disk of radius R, j a probability distribution on which 
fulfills the properties defined in Section \2.1\ and T he a polar quadtree on Dfl;. 
The expected number of nodes in T is then in 0(n}. 

Proposition 1. Let and j be as in Lemma^ Let T be a polar quadtree on 
Djj constructed to fit j. Then, for n sufficiently large, height(T) G O(logn) whp. 

A direct consequence from the results above and our previous work |19j is 
the preprocessing time for the quadtree construction. The generalized splitting 
rule and storing the subtree sizes only change constant factors. 

Corollary 1. Since a point insertion takes O(logn) time whp, constructing a 
quadtree on n points distributed as in Section 2.1 takes 0{nlogn) time whp. 


■'o.tg , 


Fig. 1. Query over 200 points in 
a polar hyperbolic quadtree, with 
/(d) — -F 1) and the 

query point q marked by a red cross. 
Points are colored according to the 
probability that they are included 
in the result. Blue represents a high 
probability, white a probability of 



Fig. 2. Visualization of the data structure used in Figure Quadtree nodes are colored 
according to the upper probability bound for points contained in them. The color of a 
quadtree node c is the darkest possible shade (dark = high probability) of any point 
contained in the subtree rooted at c. Each node is marked with the number of points in 
its subtree. 
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Algorithm 1: QuadNode.getProbabilisticNeighborhood 


Input: query point q, prob. function /, quadtree node c 
Output: probabilistic neighborhood of q 
1 N = {}; 

26 = dist(g, c); 

/* Distance between point and cell 


3 


4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 


&=/( 6 ); 

/* Since / is monotonically decreasing, a lower bound for the 
distance gives an upper bound b for the probability, 
s = number of points in c; 
if c is not leaf then 

/* internal node: descend, add recursive result to local set 

for child € children(c) do 

I add getProbabilisticNeighborhood(g, /, child) to N; 

else 

I /* leaf case: apply idea of Batagelj and Brandes 0 

for i=0; i < s ; i++ do 

5 — ln(l — rand) / ln(l — 6); 

i += (5; 

if i > s then 

I break; 

prob — /(dist(q, c.points[i]))/6; 

add c.points[i] to N with probability prob 

return N 


*/ 


*/ 


*/ 


*/ 


3.2 Algorithm 

The baseline version of our query (Algorithm has unfortunately a time com¬ 
plexity of 6 >(n), but serves as a foundation for the fast version (Section]^. It 
takes as input a query point q, a function / and a quadtree cell c. Initially, it is 
called with the root node of the quadtree and recursively descends the tree. The 
algorithm returns a point set N(g, /) C P with 


Pr [p e N(g, /) ] = /(dist(g,p)). 


( 1 ) 


Algorithm [^descends the quadtree recursively until it reaches the leaves. Once 
a leaf I is reached, a lower bound b for the distance between the query point q and 
all the points in I is computed (Line . Such distance calculations are detailed 
in Appendix I B.6 1 Since / is monotonically decreasing, this lower bound for the 
distance gives an upper bound b for the probability that a given point in / is a 
member of the returned point set (Line . This bound is used to select neighbor 
candidates in a similar manner as Bategelj and Brandes In Line 10 a random 
number of vertices is skipped, so that every vertex in I is selected as a neighbor 
candidate with probability b. The actual distance dist(( 7 , a) between a candidate 
a and the query point q is at least b and the probability of a € N(( 7 , /) thus at 
most b. For each candidate, this actual distance dist(( 7 , a) is then calculated and 
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a neighbor candidate is confirmed as a neighbor with probability /(dist(g, a))/b 
in Line [141 

Regarding correctness and time complexity of Algorithm we can state: 

Proposition 2. Let T be a quadtree as defined above, q be a query point and 
f : > [0,1] a monotonically decreasing function which maps distances to 

probabilities. The probability that a point p is returned by a PNQ (q,f) from 
Algorithm^is f{dist{q,p)), independently from whether other points are returned. 


Proposition 3. Let T be a quadtree with n points. The running time of Algo- 
rithm^per query on T is 0{n) in expectation. 


The proofs can be found in Appendices |B.4| and B.5 


4 Queries in Sublinear Time by Subtree Aggregation 


One reason for the linear time complexity of the baseline query is the fact that 
every quadtree node is visited. To reach a sublinear time complexity, we thus 
aggregate subtrees into virtual leaf cells whenever doing so reduces the number 
of examined cells and does not increase the number of candidates too much. 

To this end, let S' be a subtree starting at depth I of a quadtree T. During 
the execution of Algorithm a lower bound b for the distance between S and 
the query point q is calculated, yielding also an upper bound b for the neighbor 
probability of each point in S. At this step, it is possible to treat S as a virtual 
leaf cell, sample jumping widths using b as upper bound and use these widths 
to select candidates within S. Aggregating a subtree to a virtual leaf cell allows 
skipping leaf cells which do not contain candidates, but uses a weaker bound b 
and thus a potentially larger candidate set. Thus, a fast algorithm requires an 
aggregation criterion which keeps both the number of candidates and the number 
of examined quadtree cells low. 

As stated before, we record the number of points in each subtree during 
quadtree construction. This information is now used for the query algorithm: 
We aggregate a subtree S' to a virtual leaf cell exactly if |S|, the number of 
points contained in S, is below l//(dist(S, q)). This corresponds to less than one 
expected candidate within S. The changes required in Algorithm to use the 
subtree aggregation are minor. Lines [5l [Td] and 15 are changed to: 


5 ifc is inner node and |c| • 6 > 1 then 


14 neighbor = maybeGetKthElement(( 7 , /, i, b, c); 

15 add neighbor to N if not null 

The main change consists in the use of the function maybeGetKthElement 
(Algorithm]^ Appendix. Given a subtree S, an index k, q, /, and b, this 
function descends S to the leaf cell containing the A:th element. This element pk 
is then accepted with probability f{dist{q,pk))/b. 











Since the upper bound calculated at the root of the aggregated subtree is not 
smaller than the individual upper bounds at the original leaf cells, Proposition!^ 
also holds for the virtual leaf cells. This establishes the correctness. 

The time complexity is given by the following theorem, whose proof can be 
found in Appendix |D| 

Theorem 1. Let T be a quadtree with n points and {q, f) a query pair. A query 
{q, f) using subtree aggregation has time complexity 0 ((|N( 5 , /)| +y/n) logn) whp. 


5 Application Case Studies 

In order to test our algorithm for PNQs, we apply it in two application case studies, 
one for Euclidean, the other one for hyperbolic geometry. For the Euclidean 
case study we build a simple disease spread simulation as an example for a 
probabilistic spreading process. The probability distribution of points is in this 
case non-uniform and unknown. The hyperbolic application, in turn, is a generator 
for complex networks with a known point distribution. 


5.1 Probabilistic Spreading 

When both contact graph and travel patterns of a susceptible population are not 
known in detail, the resulting spreading behavior of an infectious disease seems 
probabilistic. Contagious diseases usually spread to people in the vicinity of 
infected persons, but an infectious person occasionally bridges larger distances by 
travel and spreads the disease this way. We model this effect with our probabilistic 
neighborhood function /, giving a higher probability for small distances and a 
lower but non-zero probability for larger distances. Note that this scenario is 
meant as an example of the probabilistic spreading simulations possible with our 
algorithm and not as highly realistic from an epidemiological point of view. 

In the simulation, the population is given as a set P of points in the Euclidean 
plane. In the initial step, exactly one point (= person) from P is marked as 
infected. Then, in each round, a PNQ is performed for each infected person q. All 
points in N{q, /) become infected in the next round. We use an SIR model [5], 
i. e. previously infected persons recover with a certain probability in each round 
and stay infectious otherwise. In our simulation, persons recover with a rate of 
0.8 and are then immune. 


5.2 Random Hyperbolic Graph Generation 


Random hyperbolic graphs (RHGs, also see Section 2.2) are a generative graph 
model for complex networks. For graph generation one places n points (= vertices) 
randomly in a hyperbolic disk. The radius R of the disk can be used to control 
the average degree of the network. A pair of vertices is connected by an edge with 
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Country 

5000 PDP queries 

Construction QT 

5000 QT queries 

France 

Germany 

USA 

1007 seconds 

1395 seconds 

4804 seconds 

1.6 seconds 

2.8 seconds 

8.7 seconds 

1.2 seconds 

1.3 seconds 

0.7 seconds 


Table 1. Running time results for polar Euclidean quadtrees on population data. The 
query points were selected uniformly at random from P, the probabilistic neighborhood 
function is f{x) := (l/x) ■ ^ jn. 


a probability that depends on the vertices’ hyperbolic distance. This connection 
probability is given in m Eq. (41)] and parametrized by a temperature T > 0: 

= g(l7f)l(^fi)72'‘qri 

This definition of random hyperbolic graphs is a generalized version of the one 
considered in our previous work, which was restricted to the special case of T = 0. 

5.3 Experimental Settings and Results 

Our implementation uses the NetworKit toolkit [18] and is written in C++ 11. 
It is included in NetworKit release 4.1. Running time measurements were made 
with g++ 4.8 -03 on a machine with 128 GB RAM and an Intel Xeon E5-1630 
v3 CPU with four cores at 3.7 GHz base frequency. Our code is sequential, as is 
the reference implementation for random hyperbolic graph generation |5]. 

Disease Spread Simulation. We experimented on three data sets taken from 
NASA population density raster data [7j for Germany, France and the USA. 
They consist of rectangles with small square cells (geographic areas) where for 
each cell the population from the year 2000 is given. To obtain a set of points, 
we randomly distribute points in each cell to fit l/20th of the population density. 
Figure]^ (left) in the appendix shows an example with roughly 4 million points 
on the map of Germany. The data sets of France and USA have roughly 3 and 
14 million points, respectively. 

The number of required queries naturally depends heavily on the simulated 
disease. For our parameters, a number of 5000 queries is typically reached within 
the first dozen steps. To evaluate the algorithmic speedup, Table [T] compares 
running times for 5000 pairwise distance probing (PDF) queries against 5000 
fast PNQs on the three country datasets. To obtain a similar total number of 
infections, we use a slightly different probabilistic neighborhood function for each 
country and divide by the population: f{x) := (1/a:) • e^/n. This results in a 
slower initial progression for the US. Our algorithm achieves a speedup factor of 
at least two orders of magnitude, even including the quadtree construction time. 

Random Hyperbolic Graph Generation. An example graph generated from hyper¬ 
bolic geometry can be seen in Figure]^ (right) in the appendix. We compare our 
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nodes 


Fig. 3. Comparison of running times to generate networks with vertices, a = 1, 

T =0.5 and average degree k = 6. The gap between the running times widens, which in 
the loglog-plot implies a different exponent in the time complexities. Running times are 
fitted with a = 2.089 • 10“’’, b = 3.311 • 10“'*, c = 2.18 • 10“® and d = 5.6 • 10“®. 


generator using PNQs with the only (to our knowledge) previously existing gen¬ 
erator for general random hyperbolic graphs [2], i.e. those not only following the 
threshold model. As seen in Figurej^ our implementation is faster by at least one 
order of magnitude and the experimental running times support our theoretical 
time complexity of + m) logn). A comparison of the generated graphs 

with those created by the existing implementation can be found in Appendix [G1 
The differences measured by a set of suitable network analysis metrics are within 
the range of random fluctuations for the sample size of 80. 


6 Conclusions 

After formally defining the notion of probabilistic neighborhoods, we have 
presented a quadtree-based query algorithm for such neighborhoods in the 
Euclidean and hyperbolic plane. Our analysis shows a time complexity of 
0{{\N{q, f)\ -I- ^/n)logn), our algorithm is to the best of our knowledge the 
first to solve the problem asymptotically faster than pairwise distance probing. 
With two example applications we have shown that our algorithm is also faster 
in practice by at least one order of magnitude. 
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A Related Algorithmic Idea 


Our approach was inspired by the following algorithm with optimal linear running 
time for Erdos-Renyi graph generation 

Algorithm 2: Efficient neighborhood generation for Erdds-Renyi graphs 
Input: number of vertices n, edge probability 0 < p < 1 
Output: G = ({0, n — 1}, E) G Q{n,p) 

E = 0; 
n = 1; 
w = - 1 ; 

while u < n do 

draw r G [0,1) uniformly at random; 

»'=«'+i+Lgfbg); 

while w > V and v < n do 
w = w — v; 

V = V + 1', 

if V < n then 

I add {m, v} to E 


B Proofs of Section |3] 

B.l Proof of Lemma [l] 

Proof. Due to the similarity of Lemma to [lU Lemma 1], the proof follows 
a similar structure. Let C be a quadtree cell at level k, delimited by min^, 
maxr, min^ and max^. As stated in Section [2.1[ we require the point probability 
distribution to be rotationally invariant. The probability that a point p is in C is 
then given by 


Pr(p G C) = ^ — J(miiv)). 

27r 

The boundaries of the children of C are given by the splitting rules in Section 

. , maxA + min<i 
mid^ := ■- ~ -- 

mid. := (d|, 


(3) 

o 

(4) 

(5) 


We proceed with induction over the depth i of C. Start of induction (i = 0): 
At depth 0, only the root cell exists and covers the whole disk. Since C = D^j, 
Pr(p G C) = 1 = 4-0. 

Inductive step (i —)■ i + 1): Let Ci be a node at depth i. Ci is delimited by 
the radial boundaries min,, and max^, as well as the angular boundaries min^ 
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and max 0 . It has four children at depth i + 1, separated by mid^ and mid^. Let 
SW be the south west child of With Eq. the probability of p e SW is: 


Pr(p e SW) = . (J (midr) — J (min^)) ( 6 ) 


Using Equations 0 and 0 , this results in a probability of 


Pr(p G SW) 

Pr(p G SW) 

Pr(p G SW) 
Pr(p G SW) 


+ _ niin^ 

2tt 


J ( (-^l[ 0 ,fl])" 


_i f J(maxj.) + J(minr) 


maxi+min^ . / \ , t / ■ \ 

- ^ - - — min^ / J(niaxrj + J(minr) 


27r 

max 0 — min<^ 

2 

27r 

1 niax0 — min^ 
4 2 ^? ' 


— J(minr) 


J(maxr) — J(minr) 


(J(maxj.) — J(minr)) 


(7) 

( 8 ) 

(9) 

( 10 ) 

( 11 ) 


As per the induction hypothesis, Pr(p G Ci) is 4“* and Pr(p G SW) is thus 
1 • 4“® = 4“*^®+^). Due to symmetry when selecting mid,^, the same holds for the 
south east child of Ci. Together, they contain half of the probability mass of Ci. 
Again due to symmetry, the same proof then holds for the northern children as 
well. □ 


B.2 Proof of Lemma 

Proof. A quadtree T containing n points can have at most n non-empty leaf cells. 
We can thus bound the total number of leaf cells in T by limiting the number of 
empty cells. 

An empty leaf cell occurs when a previous leaf cell c is split. We consider two 
cases, depending on how many of the children of c contain points: 

Case 1: All but one of the children of c are empty and all points in c are 
concentrated in one child. We call a split of this kind an excess split, since it did 
not result in dividing the points in c. 

Case 2: At least two children of c contain points. 

The number of excess splits caused by a pair of points depends on the area 
they are clustered in. Two sufficiently close points could cause a potentially 
unbounded number of excess splits. However, due to Lemma each child cell 
contains a quarter of the probability mass of its parent cell. Given two points p, q 
in a cell which is split, they end up in different child cells with probability 3/4. 


J(minr) 
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The expected number of excess splits for a point p is thus at mos10 

OO , 

(12) 

i=0 

Due to the linearity of expectations, the expected number of excess splits 
caused by n points is then at most 4n/9. Each excess split causes four additional 
quadtree nodes, three of them are empty leaf cells. 

If we remove all quadtree nodes caused by excess splits and reconnect the tree 
by connecting the remaining leaves to their lowest unremoved ancestor, every 
inner node in the remaining tree T' has at least two non-empty subtrees. Since a 
binary tree with n leaves has 0{n) inner nodes |17j and the branching factor in 
T' is at least two, T' also contains at most 0{n) inner nodes. 

Together with the expected 0{n) nodes caused by excess splits, this results 
in 0(n) nodes in T in expectation. □ 

B.3 Proof of Proposition 

Proof. We proved a similar lemma in previous work [19] . for hyperbolic geometry 
only and a restricted family of probability distributions. The requirement for 
that proof was that a given point p has a probability of to land in a given 
cell at depth i. In Lemma we show that this requirement is fulfilled for the 
quadtrees used in this paper in both Euclidean and hyperbolic geometry. We 
can thus reuse the proof of [191 Lemma 2], which we include for the purpose of 
self-containment: 


Proof of |19l Lemma 2] 

Proof. In a complete quadtree, 4* cells exist at depth i. For analysis purposes 
only, we construct such a complete but initially empty quadtree of height k = 
3 • [log 4 (n)], which has at least leaf cells. As seen in Lemma a given point 
has an equal chance to land in each leaf cell. Hence, we can apply [191 Lemma 6] 
with each leaf cell being a bin and a point being a ball. (The fact that we can 
have more than leaf cells only helps in reducing the average load.) From this 
we can conclude that, for n sufficiently large, no leaf cell of the current tree 
contains more than 1 point with high probability (whp). Consequently, the total 
quadtree height does not exceed k = 3 ■ [log 4 (n)] G O(logn) whp. 

Let T' be the quadtree as constructed in the previous paragraph, starting 
with a complete quadtree of height k and splitting leaves when their capacity is 
exceeded. Let T be the quadtree created in our algorithm, starting with a root 
node, inserting points and also splitting leaves when necessary, growing the tree 
downward. 

Since both trees grow downward as necessary to accommodate all points, but 
T does not start with a complete quadtree of height fc, the set of quadtree nodes 

^ Note that the real number of excess splits might be lower, since a split might separate 
another point from p and q. 
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in T is a subset of the quadtree nodes in T'. Consequently, the height of T is 
bounded by O(logn) whp as well. □ 


B.4 Proof of Proposition 

Proof. Note that the hyperbolic [Euclidean] distances, which are mapped to prob¬ 
abilities according to the function /, are calculated by Algorithm]^ [Algorithm]^, 
which are presented in Appendix B.6 (together with their correctness proofs). 
We continue the current proof with details for all three main steps. 


Step 1: Between two points, the jumping width 5 is given by Line 10 The 


probability that exactly i points are skipped between two given candidates is 

(1 - by ■ b: 


Pr(i < (5 < i-k 1) = (13) 

Pr(* < ln(l — r)/ ln(l — 6) < i -|- 1) = (14) 

Pr(ln(l — r) <i ■ ln(l — b) A ln(l — r) > (i -|- 1) • ln(l — b)) = (15) 

Pr(l-(l-6)*<r<l-(l-5)*+i)= (16) 

l-(l-5)*+i-l-f (1-5)*= (17) 

(l_5)*(l-(l-5))= (18) 

(1-6)*-6 (19) 


Note that in Eq. (14) the denominator is negative, thus the direction of the 
inequality is reversed in the transformation. The transformation from Eq. (16) 
(|17| works since r is uniformly distributed. 


to Eq. 

Following from Eq. (19), the probability is 6 for z = 0, and if a point is selected 
as a candidate, the subsequent point is selected with a probability of 6. 


Step 2: Let pt, pj and pi be points in a leaf, with i < j < I and let pi be a 
neighbor candidate. For now we assume that no other points in the same leaf 
are candidates and consider the probability that pi is selected as a candidate 
depending on whether the intermediate point pj is a candidate. 

Case 2.1: If point pj is a candidate, then point pi is selected if I — j points 
are skipped after selecting pj. Due to Step 1, this probability is (1 — by~^ ■ b 
Case 2.2: If point pj is not a candidate, then point pi is selected if I — i 
points are skipped after selecting pi. Given that Pj is not selected, at least j — i 


points are skipped. The conditional probability is then: 

Pr(l — i<5<l — i-\-l\5>j — i)= (20) 

Pr(l - (1 - 6)'-* < r < (1 - (1 - 6)'-*+i)|,5 > j - *) = (21) 

( 1 - 6 )'-*- 6 /( 1 - 6 )^-* = ( 22 ) 

(1 - ly-^ ■ 6 (23) 


As both cases yield the same result, the probability Pr{pi S Candidates) is 
independent of whether pj is a candidate. 
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Step 3: Let C be a leaf cell in which all points up to point pi are selected as 
candidates. Due to Step 1, the probability that Pi+i is also a candidate, meaning 
no points are skipped, is (1 — 5)° • 6 = h. Due to Step 2, the probability of 
being a candidate is independent of whether pi is a candidate. This can be applied 
iteratively until the beginning of the leaf cell, yielding a probability of b for pi 
being a candidate, independent of whether other points are selected. 

A neighbor candidate pi is accepted as a neighbor with probability /(dist(pi, q))/b 


in Line 14 Since b is an upper bound for the neighborhood probability, the ac¬ 


ceptance ratio is between 0 and 1. The probability for a point p to be in the 
probabilistic neighborhood computed by Algorithm is thus: 

Pr(p e N( 9 ,/)) = (24) 

Pr(p e N(( 7 , /) A p e Candidates(( 7 , /)) = (25) 

Pr(p e N(( 7 , f)\p e Candidates(( 7 , /)) • Pr(p € Candidates(( 7 , /)) = (26) 

f(dist(p,g))/b-b= (27) 

f(dist(p,q)) (28) 


□ 


B.5 Proof of Proposition 

Proof. The total time complexity of the query algorithm is determined by the 
number of recursive calls (Line and the number of loop iterations (Line . 
During tree traversal, one recursive call is made for each examined quadtree node. 
During examination of a leaf, one loop iteration happens for every examined 
candidate. Let the set of neighbors {N{q,f)), candidates (Candidates(q,/)) and 


examined cells (Cells(( 7 , /)) be as defined in Section 2.1 The time complexity of 
the query is then in 0(|Candidates(q, /)| -I- |Cells(( 7 , /)|). 

All cells of the quadtree are examined, thus Cells(g, /) = Cells(r). If the cells 
are split using the medians of point positions, then no leaf cell is empty and the 
tree contains at most n cells. If cells are split using the theoretical probability 
distributions, the tree contains at most 0(n) cells in expectation due to Lemmaj^ 
It follows that the number of examined cells is in 0(n) in expectation. Since the 
candidate set is a subset of the point set, the expected number of candidates is 
at most n. The query time complexity is then in 0(n) + 0(|Cells(T)| = 0(n) in 
expectation. □ 


B.6 Distance between Quadtree Cell and Point 

To calculate the upper bound b used in Algorithm we need a lower bound for 
the distance between the query point q and any point in a given quadtree cell. 
Since the quadtree cells are polar, the distance calculations might be unfamiliar 
and we show and prove them explicitly. For the hyperbolic case, the distance 
calculations are shown in Algorithm and proven in Lemma The Euclidean 
calculations are shown in Algorithm |^and proven in Lemma 
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Algorithm 3: Infimum and supremum of distance in a hyperbolic polar 
quadtree 

Input: quadtree cell C = (miur, max^, min^, max^), query point q = {(j)q,rq) 
Output: infimum and supremum of hyperbolic distances q to interior of C 
/* start with corners of cell as possible extrema */ 

1 cornerSet = {(min^, mim), (min,/,, maxr), (max,/,, min^), (max 0 , maxr)}; 

2 a = cosh(rq); 

3 b = sinhrq • cos{4>q — min,^); 

/* Left/Right boundaries */ 

4 leftExtremum = | In ; 

5 if mitir < leftExtremum < maxr then 

6 I add (min,/,, leftExtremum) to cornerSet; 

7 b = sinhrg • cos{(f>q — max,/,); 

8 rightExtremum = | In ^ j; 

/* Top/bottom boundaries */ 

9 if mitir < rightExtremum < maXr then 

10 I add (max0, rightExtremum) to cornerSet; 

11 if min^ < (j>qmax,f, then 

12 I add {<j)q,mmr) and {(j>q,ma.Xr) to cornerSet; 

13 <(>mirrored — “t“ tV mod 2?!; 

14 if mm,/, < (firairrored < max^ then 

15 I add (^mirrored, miur) and (^mirrored, maxr) to comeiSet; 

/* If point is in cell, distance is zero: */ 

16 if min,/, < (j>q < max^ AND mirir < Tq < maXr then 

17 I infimum = 0; 

18 else 

19 I infimum = minegcornerSet distH(q, e); 

20 supremum = maxegcomerSet diste(i7, e); 

21 return infimum, supremum-, 


Lemma 3. Let C be a quadtree cell and q a point in hyperbolic space. The first 
value returned by Algorithm^ is the distance of C to q. 

Proof. When q is in C, the distance is trivially zero. Otherwise, the distance 
between q and C can be reduced to the distance between q and the boundary of 
C, SC: 

distH(C', 9 ) = distH(<5C',q) = inf distH(p,(?) (29) 

P&SC 

Since the boundary is closed, this infimum is actually a minimum: 

distH(C', g)= inf distH(p, g) = min distH(p, g) (30) 

pGSC pdSC 

The boundary of a quadtree cell consists of four closed curves: 

— left: {(min^, r)|minr <r< max^.} 

— right: {(max^, r)|minr < r < max^} 
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Algorithm 4: Infimum and supremum of distance in a Euclidean polar 
quadtree 

Input: quadtree cell C = (minr, max,., min,^, max^), query point q = {4'q,'f'q) 
Output: infimum and supremum of Euclidean distances q to interior of C 
/* start with corners of cell as possible extrema */ 

1 cornerSet = {(min^, mim), (min^, maxr), (max,^, min^), (max 0 , maxr)}; 

/* Left/Right boundaries */ 

2 leftExtremum= r, • cos(min 0 — 4>q)', 

3 if mirir < leftExtremum < maxr then 

4 I add (min^,leftExtremum) to cornerSet; 

5 rightExtremum= Vq ■ cos(max^ — </>,); 

6 if mirir < rightExtremum < maXr then 

7 I add (max,^, rightExtremum) to cornerSet; 

/* Top/bottom boundaries */ 

8 if mm^ < (j>q < max^ then 

9 I add mirir) and {(f>q,ma.Xr) to cornerSet; 

10 <)^mirrored — “t“ 'll mod 27r, 

11 if min^ < cjimirrored < mux^ then 

12 I add (ijAmirrored, miUr) and (()imirrored, maxr) to comeiSet; 

/* If point is in cell, distance is zero: */ 

13 if miritf, < 4>q max^ AND mirir < Vq < maXr then 

14 I infimum = 0; 

15 else 

16 I infimum = minegcornerSet distH(g, e); 

17 supremum = maxegcomerSet distH((?, e); 

18 return infimum, supremum-, 


— lower: minr)|min,^ < ^ < max^} 

— upper: {(()), max^) lining <</!)< max^} 

We write the distance to the whole boundary as a minimum over the distances 
to its parts: 


distH((5C',g) = min distH(7l,(7) (31) 

AG{left, right, lower, upper} 

All points on an angular boundary curve A have the same angular coordinate 
(j)A- Let dA{r) = acosh(cosh(r) cosh(rg) — sinh(r) sinh(rq) cos(( 5 ig — (j)A)) for a 
fixed point q. The distance distH(A, q) can then be reduced to: 

distH(A, g)= min dA{r) (32) 

mirir. < r< max 

(33) 

The minimum of dA on A is the minimum of ^^(minr), ^^(iiiaxr) and the 
value at possible extrema. To find the extrema, we define a function g(r) = 
cosh(r) cosh(rq) — sinh(r) sinh(rg) cos(tf>q — (Pa)- Since acosh is strictly monotone, 
g(r) has the same extrema as dAir). 
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The factors cosh(rq) and sinh(rg) cos(4>q — (Pa) do not depend on r, to increase 


readability we substitute them with the constants a and b: 

a = cosh(rg) (34) 

b = sinh(rg) cos{(j)q — Pa) (35) 

dAif) = acosh(cosh(r) • a — sinh(r) • b) (36) 

g(r) = cosh(r) • a — sinh(r) • b (37) 

The derivative of g is thus: 

gr _ g-r gr _|_ g-r 

g'{r) = sinh(r) • a — cosh(r) • b = ---• a ---• b (38) 


With some transformations, we get the roots of g'{r): 


Case a = b: 


g'ir) = 



For a = cIa has no extrema in M. 



(39) 

e’' + e”’’ 

(40) 

2 

e’' + e"’’ 

(41) 

e-^ 

(42) 

0 

(43) 


(44) 


a 7 ^ b: 


e — e 


a = 


e = 




0<t^ 

(45) 

e’’ + e-’' , 

2 

(46) 

be'' + be-'' ^ 

(47) 


(48) 

(a + b)e~‘^ ^ 

(49) 

Cl -\- b 

A ^ 

a — b 

(50) 

a + b 

a — b 

(51) 


(52) 

H^) 

(53) 
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For a ^ b, dA has a single extremum at | In f j . This extremum is calculated 
for both angular boundaries in Lines and of Algorithm 

If d{r) has an extremum x in A, the minimum of dAir) on A is min{dyi(minr), 
c?y 4 (maxr), dAix)}, otherwise it is min{(i^(minr), dyi(maxr)}. 

A similar approach works for the radial boundary curves. Let i? be a radial 
boundary curve at radius tb and angular bounds min^ and max<^. Let dB{4>) be 
the distance to q restricted to radius tb- 

dB : [0, 2tt] K (54) 

dBifj}) = acosh(cosh(rB) cosh(rg) — sinh(rB) sinh(rq) cos{(j)q — (j))) (55) 

Similarly to the angular boundaries, we define some constants and a function 
g{(j)) with the same extrema as ds: 

a = cosh(rB) cosh(rq) 
b = sinh(rB) sinh(rq) 
g{4>) = a — 6cos(((ig — 4>) 

Case: b = 0: 

b = sinh(rB) sinh(rq) = 0 

9{(t>) = a 

Since g is constant, no extrema exist. 

Case: b ^ Q: We obtain the extrema with some transformations: 


g'{(j)) = -bsm{(j)q- (j)) (61) 

g'W = 0 ^ (62) 

sm{4>q — (j)) = 0 ^ (63) 

(j) = (t^q niod TT (64) 


The distance function d_B((/)) thus has two extrema. 

The minimum of dB{r) on B is then: 

mindBir) = min{dB(minj.), (iB(maXi.)}U{dB(<(')|nrinA < (f) < max^At)) = 4>q mod tt} 

r^B 

(65) 

The distance dist]H[(C', 9 ) can thus be written as the minimum of four to ten 
point-to-point distances. Algorithmcollects the arguments for these distances 
in the variable cornerSet and returns the distance minimum as the first return 
value. □ 

Lemma 4. Let T be a polar quadtree in Euclidean space, c a quadtree cell of T 
and q a point in Euclidean space. The first value returned by Algorithm^ is the 
distance of c to q. 


(56) 

(57) 

(58) 


(59) 

(60) 
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Proof. The general distance equation for polar coordinates in Euclidean space is 


firp, rg, 0p, f>g) = \Jrl + rl- 2rprg cos{f)p - (fg) (66) 

If the query point q is within C, the distance is zero. Otherwise, the distance 
between q and C is equal to the distance between q and the boundary of C. We 
consider each boundary component separately and derive the extrema of the 
distance function. 


Radial boundary. When considering the radial boundary, everything but one 
angle is fixed: 

/(<('p) = \l’rl + r'^q- 2rpr, cos{(j)p - (fg) (67) 

Since the distance is positive and the square root is a monotone function, the 
extrema of the previous function are at the same values as the extrema of its 
square g{4>)'. 

9{^p) = rl + rl- 2rprg cos(<))p - (jjg) (68) 

We set the derivative to zero to find the extrema: 

g'(f)p)=0 4^ (69) 

2rprg sm{(l)p - (fg) ■ {cfp - (j)g)=Q (70) 

f’p = ‘Pq mod TT (71) 


Angular boundary. Similar to the radial boundary, we fix everything but the 
radius: 

/(rp) = ^rj + rl- 2rpr, cos{4>p - pq) (72) 

Again, we define a helper function with the same extrema: 

g^Tp) = rl+rl- 2rprq cos((/)p - (/>,) (73) 

We set the derivative to zero to find the extrema: 

g'{rp)=Q^ (74) 

2rp — 2rq cos{(j)p — pq) = 0 (75) 

rp = rq cos{(j)p - pq) ^ (76) 

9(Tp) =rl + rl- 2rl (77) 

= rl-rl (78) 

= rl(l - cos{(j)p - (pq)) (79) 

An extremum of / on the boundary of cell c is either at one of its corners or 
at the points derived in Eq. © or Eq. ( |79[ ). If g ^ c, the minimum over these 
points and the corners, as computed by Algorithm is the minimal distance 
between q and any point in c. If q is contained in c, the distance is trivially 
zero. □ 
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C Algorithm maybeGetKthElement, used in Section 


Algorithm 5: maybeGetKthElement 

Input: query point q, function /, index k, bound b, subtree S 
Output: fcth point of S or empty set 

1 if S. is Leaf0 then 

2 acceptance =/(dist((;, S.points[fc]))/&; 

3 if I — rand{) < acceptance then 

4 I return S.points[fc]; 

5 else 

6 I return 0; 


7 else 


9 


8 


/* Recursive call 
offset := 0; 

for child S S.children do 


11 


10 


if fc — offset < |child| then 

/* I child I is the number of points in child 
return maybeGetKthElement(g, f,k- offset, b, child); 


12 


offset += I child]; 
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D Proof of Theorem [T] 


Proof. Similar to the baseline algorithm, the complexity of the faster query is 
determined by the number of recursive calls and the total number of loop iterations 
across the calls. The first corresponds to the number of examined quadtree cells, 
the second to the total number of candidates. With subtree aggregation, we obtain 
improved bounds: Lemmalimits the number of candidates to 0(|N((7,/)| + 
^/ri) whp, while Lemma bounds the number of examined quadtree cells to 
0((|N(g, f)\ + y/n) logn) whp. Together, this results in a query complexity of 
0((|N(g,/)| + ^)logn) whp. □ 


For the lemmas required in the proof of Theorem we need to introduce 
some notation: Let T be a quadtree with n points, S a subtree of T containing 
s points, q a query point and / a function mapping distances to probabilities. 
The set of neighbors (N(g,/)), candidates (Candidates(g,/)) and examined cells 


{Cel\s{q, f)) are defined as in Section 2.1 


For the analysis we divide the space around the query point q into infinitely 
many bands, based on the probabilities given by /. A point p S P is in band i 
exactly if the probability of it being a neighbor of q is between and 2“*: 

p G band i < /(dist(p, 9 )) < 2"* 


Based on these bands, we divide the previous sets into infinitely many subsets: 

- ?(<?,/, 0 := {v G P|2-(*+i) < /(dist(u,g)) < 2"*} 

- N(g, /, i) := N(g, /) n P(g, /, i) 

- Candidates(g, /, i) := Candidates(g, /) n P(g, /, *) 

- Cells(g,/,i) := {c G Cells(( 7 ,/)|2-(*+i) < /(dist(c,g)) < 2"*} 

Note that for fixed n, all but at most finitely many of these sets are empty. 
We call the quadtree cells in Cells(( 7 ,/, i) to be anchored in band i. The region 
covered by a quadtree cell is in general not aligned with the probability bands, 
thus a quadtree cell anchored in band i {c G Cells(q,/, *)) may contain points 
from higher bands (i.e. with lower probabilities). 

We continue with two auxiliary results used in Lemma Lemma helps in 
bounding the number of candidates that are in the same band as their (virtual 
or original) quadtree cell is anchored in. Lemma is used to bound the number 
of points in a higher band than their original quadtree cell. 

Lemma 5. Let n be a natural number and let A, B be sets with A C B, \B\ < n 
and the following property: Pr(& G A) > 0.5, V6 G B. Further, let the probabilities 
for membership in A be independent. Then, the number of points in B is in 
0(|A| + logn) with probability at least 1 — Xjr?. 

Proof. Let A = |A| be a random variable denoting the size of A. Since the 
individual probabilities for membership in A might be different, X does not 
necessarily follow a binomial distribution. We define an auxiliary distribution 
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Y := B{\B\,0.5). Since all membership probabilities for A are at least 0.5, lower 
tail bounds derived for Y also hold for X. 

The probability that Y is less than 0.1|i3| is then [3]: 


Pr(y < 


0.1 S ) < exp ^ 

J0.5|B|-0.1|i?|)2^ 

\B\ ) 

(80) 

= exp ^ 

\B\ ) 

(81) 

= exp (- 

-2 ■ 0.16|B|) 

(82) 

= exp (- 

-0.32|B|) 

(83) 



(84) 


Similar to the proof of Lemma we conclude with a case distinction: 

If \B\ > lOlogn; The probability Pr(|^| < 0.1|i3|) is then Pr(|A| < 0.1|i3|) < 
Pr(r < 0.1|B|) < exp(-3.21ogn) = < l/n^. Thus \B\ < 10|^| G 0{\A\) 

with probability at least 1 — 1/n^. 


If \B\ < 10 logn; \B\ is then trivially in O(logn). 


□ 


Lemma 6. Let T be a polar hyperbolic [Euclidean] quadtree with n points and 
s < n a natural number. Let A be a circle in the hyperbolic [Euclidean] plane and 
let © be the disjoint set of subtrees of T that contain at most s points and are cut 
by A. Then, the subtrees in 0 contain at most ■ s points with probability at 

least 1 — 0.7^ for n sufficiently large. 


Proof. This proof is adapted from [m Lemma 3]. Let k := [log 4 n/sJ be the 
minimal depth at which cells have at least s points in expectation. At most 4^ 
cells exist at depth k, defined by at most 2^ angular and 2^' radial divisions. When 
following the circumference of the query circle A, each newly cut cell requires 
the crossing of an angular or radial division. Each radial and angular coordinate 
occurs at most twice on the circle boundary, thus each division can be crossed at 
most twice. With two types of divisions, A crosses at most 2 • 2 • 2^ = 4 • "/'’J 

cells at depth k. Since the value of 4 • 2L*°S4"/sJ is at most 4 • this yields 

< 8 • \/nfs cut cells. We denote the set of cut cells with <^. Since the cells in 
cover the circumference of the circle A, a subtree S which is cut by A is either 
contained within one of the cells in q, corresponds to one of the cells or contains 
one. In the first two cases, all points in S are within the cells of In the second 
case, at least one cell of i; is contained in S. As the subtrees are disjoint, this cell 
cannot be contained in any other of the considered subtrees. Thus, there are no 
more subtrees containing points not in than there are cells in g, which are less 
than 8 • \pnfs many. 

Due to Lemma [2 the probability that a given point is in a given cell at level 
k is 4“^'. The number of points contained in cells of thus follows a binomial 
distribution B(n,p). An upper bound for the probability p is given by thus 
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a tail bound for a slightly different distribution B{n, also holds for B{n,p). 

In the proof of m Lemma 7] a similar distribution is considered. Setting the 
variable c to 8-y/s, we see that the probability of containing more than 16 • 
points is smaller than 0.7'/". 

The subtrees in 0 contain at most s points by dehnition, thus an upper bound 
for the number of points in these subtrees is given by s • 8 • \fnfs (points not in 
<;) + 16 • (points in <;). This results in at most 24 • ^fsn points contained in 
0 with probability at least 1 — 0.7'/". □ 

The following Lemmas and bound the number of examined candidates 
and examined quadtree cells and are used in the proof of Theorem 

Lemma 7. Let T be a quadtree with n points and {q, f) a query pair. The number 
of candidates examined by a query using subtree aggregation is in 0 (|N(( 7 , f)\+\^) 
whp. 

Proof. For the analysis we consider each probability band i separately. As dehned 
above, band i contains points with a neighbor probability of to 2“®. 

Among the cells anchored in band i, some are original leaf cells and others are 
virtual leaf cells created by subtree aggregation. The virtual leaf cells contain 
less than one expected candidate and thus less than 2®+^ points. The capacity of 
the original leaf cells is constant. All the points in cells anchored in band i have 
a probability between 2“^®+^^ and 2“® to be a candidate. Among the points in 
virtual or original leaf cells, some are in the same band their cell is anchored in, 
others are in higher bands. 

We divide the set of points within cells anchored in band i into four subsets: 

1. points in band i and in original leaf cells 

2. points in band i and in virtual leaf cells 

3. points not in band i and in original leaf cells 

4. points not in band i and in virtual leaf cells 

The points in the first two sets are unproblematic. Since the probability 
that a point in these sets is a neighbor is at least 2“(®+^\ the probability for a 
given candidate to be a neighbor is at least ^. Due to Lemma ^ the number of 
candidates in these sets is in 0(|N(g, /)|+logn) whp, which is in 0(|N(g, f)\ + y/n) 
whp. 

Points in the third set are in cells cut by the boundary between band i 
and band z + 1. Since the probabilities are determined by the distance, this 
boundary is a circle and we can use Lemma to bound the number of points to 
24y/n ■ capacity with probability at least 1 — 0.7'/" for n sufficiently large. The 
mentioned capacity is the capacity of the original leaf cells. 

Likewise, points in the fourth set are in virtual leaf cells cut by the boundary 
between bands i and z + 1. A virtual leaf cell, which is an aggregated subtree, 
contains at most 2®+^ points, otherwise it would not have been aggregated . Again, 
using Lemma 1^ we can bound the number of points in these sets to 24-\/ n ■ 2®+^ 
points with probability at least 1 — 0.7^^. 
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We denote the union of the third and fourth sets with Overhang(( 7 ,/, i). 
From the individual bounds derived in the previous paragraphs, we obtain an 
upper bound for the number of points in Overhang(g, /, i) of 2A{y/n ■ capacity + 
Vn ■ 2*+i) with probability at least (1 — 0.7'^)^. Simplifying the bound, we get 
that |Overhang(q, /, i)| < 24-^71 • + yj capacity) with probability at least 

1 - 2 • 0 . 7 ^. 

Each of the points in Overhang( 5 , /, i) is a candidate with a probability 
between 2“* and The candidates are sampled independently (see Step 

2 of Lemma . While different points may have different probabilities of being 
a candidate and the total number of candidates does not follow a binomial 
distribution, we can bound the probabilities from above with 2“*. 

We proceed towards a Chernoff bound for the total number of candidates across 
all overhangs. Let Xi denote the random variable representing the candidates 
within |Overhang(q,/, i)! and let X = denote the total number of 

candidates in overhangs. 

The expected value E(Jf) follows from the linearity of expectations: 


OO 


Y,nxi) 

(85) 

oo 

24:y/n • + -y/capacity) • 2“*) 

(86) 

oo 

24 V 2 ■ 2“*'^^ -1- 2“®-y/capacity)) 

(87) 

24v^((2'\/2 -1- 2) -1- 2-\/capacity) 

(88) 


(Cells anchored in the band oo, which has an upper bound b of zero for the 
neighborhood probability, do not have any candidates and can be omitted here.) 

Since the candidates are sampled independently with a probability of at most 
2“®, we can treat X as a sum of independent Bernoulli random variables without 
loosing generality. This allows us to use a multiplicative Chernoff bound m and 
we can now give an upper bound for the probability that the overhangs contain 
more than twice as many candidates as expected: 


/ e 

Pr(X > 2E(X)) < (^J 

(89) 

/ g \ 24Y^((2\/2+2)+2\/capacity) 

(90) 

VI 

(91) 

< 0 . 7 ^ 

(92) 


While the random variable X = written as an infinite sum, all 

but at most n bands are empty, thus we are only applying the Chernoff bound 
over finitely many variables. For each of the at most n non-empty bands, we 
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defined two tail bounds for the number of points in the overhang. Including this 
last bound, we thus have a chain of 2n + 1 tail bounds, each with a probability 
of at least (1 — 0.7^). The event that any of these tail bounds is violated is 
a union over each event that a specific tail bound is violated. With a union 
bound na Lemma 1.2], the probability that any of the individual tail bounds 
is violated is at most (2n + 1)0.7^. Since ( 2 n+i)o 7 v^ grows faster than n 
for n sufficiently large, we conclude that the total number of candidates is thus 
bounded by 0(|N(q, /)|)+48-\/n((2-\/2 + 2) + 2-y/capacity) with probability at least 
(1 — 1/n) for n sufficiently large. The leaf capacity is constant, thus the number 
of candidates evaluated during execution of a query (g, /) is in 0 (|N(( 7 , /)] + 
whp. □ 


We proceed with an auxiliary result necessary for bounding the number of 
examined quadtree cells in a query: 


Lemma 8. Let T be a quadtree with n points and (g, /) a query pair. The 
number of quadtree cells examined by a query using subtree aggregation is in 
0((|N(g,/)| + v^) logn). 

To prove Lemma we first introduce another auxiliary lemma: 

Lemma 9. Let be a hyperbolic or Euclidean disk of radius R and let T be a 
polar quadtree on containing n points distributed according to Section \2.1\ Let 
T(q,f) be the set of unaggregated quadtree cells that have only (virtual) leaf cells 
as children (category C2 in the proof of Lemma^. With a query using subtree 
aggregation, \T{q,f)\ is in 0(|N(q,/)] + Vn) whp. 


Proof. Let c G T{q, /, i) be such an unaggregated quadtree cell anchored in band 
i that has only original or virtual leaf cells as children. It contains at least 2® 
points and has four children, of which at least one is also anchored in band i. We 
denote this (virtual) leaf anchored in band i with 1. Since each child of c contains 
the same probability mass (Lemma [^, each point of c is in ^ with probability 
1/4: 

Pr(p G l\p G c) = - . (93) 


A point in 1 is a candidate (in 1) with probability /(dist(g, 1)), which is between 
2-(»+i) and 2“® since I is anchored in band i. The probability that a given point 
p S c is a candidate in I is then 


Pr(p G I Ap G Candidates(g, /, i) 


pec) = ^-/(dist(g,0)>2-(®+3) (94) 


Since the point positions and memberships in Candidates)^, /, i) are indepen¬ 
dent, we can bound the number of candidates in I with a binomial distribution 
i3(|c|,2“(®+^)). The probability that I contains no candidates is: 


/(0,|c|,g-2-®) = (l 

<(1 


1 

8 

1 

8 


2-z)|c| 

(95) 

-f 

2® 

(96) 
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Considered as a function of i, this probability is monotonically ascending. In 
the limit of 2* —>• oo, it trends to exp(—1/8) « 0.88, a value it never exceeds. 
The probability that the cell c contains at least one candidate is then above 
1 - > 0 . 1 . 

For each cell in T, the probability that it contains at least one candidate 
is > 0.1. Let X be the random variable denoting the number of cells in Tthat 
contain at least one candidate. We define an auxiliary binomial distribution 
i3(|T|,0.1) and use a tail bound to estimate the number of cells in Tcontaining 
candidates. Let Y oc i3(|r|,0.1) be a random variable distributed according to 
this auxiliary distribution. 

We use a tail bound from |3] to limit the probability that Y < 0.05|T| to at 
most exp(—|T|/80). Since 0.1 was a lower bound for the probability that a cell 
contains a candidate, this tail bound also holds for X. The probability that the set 
of Tcontains at least 0.05|T| many candidates is then at least (1 — exp(—|T’|/80)). 

We continue with a case distinction: 


If |T| S The probability (1 — exp(—|T|/80)) is then smaller than (1 — 

exp(—-^71/80)), which is < 1/n for sufficiently large n. Thus the number of 
examined quadtree cells during a query is then linear in the number of candidates. 
Due to Lemma 0 this is in 0(|N(g, /)| + ^/n). 


If \T\ G o{y/n): The cardinality |T| is trivially in 0{^/n). 


□ 


The proof of Lemma then follows easily: 

Proof. We split the set of examined quadtree cells into three categories: 

— leaf cells and root nodes of aggregated subtrees (Cl) 

— parents of cells in the first category (C2) 

— all other (C3) 

The third category (C3) then exclusively consists of inner nodes in the quadtree. 
When following a chain of nodes in category C3 from the root downwards, it 
ends with a node in category C2. The size 1(731 is thus at most 0{\C2\ logn) whp, 
since the number of elements in a chain cannot exceed the height of the quadtree, 
which is (7(logn) by Proposition 

With a branching factor of 4, |(71| = 4|(72| holds. 

The number of cells in category C2 can be bounded using Lemma to 
0(|N| + -y/n) with high probability. The total number of examined cells is thus in 
(9((|N| + y/n) logn). □ 
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E Visualizations of Experimental Results 



latitude 



Fig. 4. Left: Twenty-third time step of a simulated disease progression through Germany. 
The colors indicate the number of infected persons within a cell. Right: Random 
hyperbolic graph with 500 nodes and average degree 12. 
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F Performance of Baseline Algorithm 



n = 10^, sub agg. 

• n = 10®, sub agg. 

• n = 10®, sub agg. 
n = lO'^, baseline 

* n = 10®, baseline 
n = 10“^, impl. of [2] 

♦ n = 10®, impl. of [2] 

n = lO"^, theoretical fit for baseline 
-n = 10®, theoretical fit for baseline 


Fig. 5. Comparison of running times to generate networks with 10^ to 10® vertices. 
Generating a graph requires n queries. Shown are running times of the baseline algorithm, 
queries using subtree aggregation and the implementation of [5]. The theoretical fit is 
given by the equation T{n, m) = (7.94 • + 4.1 • 10~'*n) seconds. The baseline 

algorithm is still faster than the previous implementation [2], but much slower than the 
improved query using subtree aggregation. 
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G Fast RHG Generator vs Reference Implementation f2| 
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Fig. 6. Comparison of clustering coefficients, degree assortativity and measured vs 
desired power-law exponent 7 . Shown are the implementation of [2] (left) and our 
implementation (right). The clustering coefficient describes the ratio of closed triangles 
to triads in a graph. Degree assortativity describes whether vertices have neighbors of 
similar degree. The degree distribution of random hyperbolic graphs follows a power 
law, whose exponent 7 can be adjusted. In the degree distribution plot, the blue curve 
is almost always identical to the red curve and thus covered by it. Values are averaged 
over 80 runs. 
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